43 research outputs found
Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing
Audio anti-spoofing for automatic speaker verification aims to safeguard
users' identities from spoofing attacks. Although state-of-the-art spoofing
countermeasure(CM) models perform well on specific datasets, they lack
generalization when evaluated with different datasets. To address this
limitation, previous studies have explored large pre-trained models, which
require significant resources and time. We aim to develop a compact but
well-generalizing CM model that can compete with large pre-trained models. Our
approach involves multi-dataset co-training and sharpness-aware minimization,
which has not been investigated in this domain. Extensive experiments reveal
that proposed method yield competitive results across various datasets while
utilizing 4,000 times less parameters than the large pre-trained models.Comment: Interspeech 202
Selective Kernel Attention for Robust Speaker Verification
Recent state-of-the-art speaker verification architectures adopt multi-scale
processing and frequency-channel attention techniques. However, their full
potential may not have been exploited because these techniques' receptive
fields are fixed where most convolutional layers operate with specified kernel
sizes such as 1, 3 or 5. We aim to further improve this line of research by
introducing a selective kernel attention (SKA) mechanism. The SKA mechanism
allows each convolutional layer to adaptively select the kernel size in a
data-driven fashion based on an attention mechanism that exploits both
frequency and channel domain using the previous layer's output. We propose
three module variants using the SKA mechanism whereby two modules are applied
in front of an ECAPA-TDNN model, and the other is combined with the Res2Net
backbone block. Experimental results demonstrate that our proposed model
consistently outperforms the conventional counterpart on the three different
evaluation protocols in terms of both equal error rate and minimum detection
cost function. In addition, we present a detailed analysis that helps
understand how the SKA module works.Comment: Submitted to INTERSPEECH 2022. 5 pages, 3 figures, 1 tabl
Capturing scattered discriminative information using a deep architecture in acoustic scene classification
Frequently misclassified pairs of classes that share many common acoustic
properties exist in acoustic scene classification (ASC). To distinguish such
pairs of classes, trivial details scattered throughout the data could be vital
clues. However, these details are less noticeable and are easily removed using
conventional non-linear activations (e.g. ReLU). Furthermore, making design
choices to emphasize trivial details can easily lead to overfitting if the
system is not sufficiently generalized. In this study, based on the analysis of
the ASC task's characteristics, we investigate various methods to capture
discriminative information and simultaneously mitigate the overfitting problem.
We adopt a max feature map method to replace conventional non-linear
activations in a deep neural network, and therefore, we apply an element-wise
comparison between different filters of a convolution layer's output. Two data
augment methods and two deep architecture modules are further explored to
reduce overfitting and sustain the system's discriminative power. Various
experiments are conducted using the detection and classification of acoustic
scenes and events 2020 task1-a dataset to validate the proposed methods. Our
results show that the proposed system consistently outperforms the baseline,
where the single best performing system has an accuracy of 70.4% compared to
65.1% of the baseline.Comment: Submitted to DCASE2020 worksho